Method and system for facilitating predicting the three-dimensional structure of an amino acid sequence

ABSTRACT

One embodiment of the subject matter can facilitate three-dimensional protein structure prediction from an amino acid sequence based on dynamic programming and a probabilistic model learned from training examples. This embodiment is efficient, accurate, can easily be parallelized, and guarantees a prediction that is a most likely three-dimensional protein structure from the amino acid sequence. Moreover, this embodiment is rotation invariant and not require physical or biological knowledge to determine a protein&#39;s three-dimensional configuration based on a corresponding amino acid sequence.

BACKGROUND Field

The subject matter relates to predicting the three-dimensional structureof an amino acid sequence. Amino acids are the building blocks ofproteins. Currently, there are twenty-one proteinogenic amino acids inhumans: alanine, arginine, asparagine, aspartic acid, cysteine,glutamine, glutamic acid, glycine, histidine, isoleucine, leucine,lysine, methionine, phenylalanine, proline, selenocysteine, serine,threonine, tryptophan, tyrosine, and valine. Proteins play numerouscritical roles in the body, functioning as enzymes, structuralcomponents, antibodies, messengers, and in the transport of atoms andsmall molecules within cells and throughout the body.

Related Art

In accepting the 1972 Nobel Prize in Chemistry, Christian Anfinsenfamously said that a protein's amino acid sequence should fullydetermine its three-dimensional structure. This statement has sparkednearly five decades of research aimed at predicting a protein'sthree-dimensional structure based on its amino acid sequence.

This prediction problem is difficult solve because a typical protein canfold into 10³⁰⁰ possible three-dimensional structures. Even running thefastest current computer since the Big Bang, the enumeration of allpossible three-dimensional structures for a typical protein would justbe getting started. In contrast, proteins in nature spontaneously foldinto their three-dimensional structure in milliseconds. Not only have nocomputational methods been developed that can predict athree-dimensional structure that quickly, but even our best currentprediction methods are relatively inaccurate, especially when nohomologous structures exist.

Roughly 200 million proteins (amino acid sequences) are known, but thethree-dimensional structure is known for only a tiny fraction of these.The lack of three-dimensional structures for proteins is because thedirect method, X-ray analysis, is expensive and difficult to apply to alarge number of proteins. An efficient way to predict the secondarystructure of proteins based on the known three-dimensional structures ofproteins in a repository might be a solution to this bottleneck. Inparticular the repositories SWISS-MODEL, Genome3D end ModBase providefree access to a large numbers of protein structures. One promisingavenue is to machine learn a three-dimensional structure predictionmodel based on training data comprising amino acid sequences for which athree-dimensional structure is known.

Most recently, AlphaFold became the first machine learning predictionmodel that could predict a protein's three-dimensional structure withrelatively high accuracy, even when no homologous structures exist.AlphaFold combines deep learning neural networks with physical andbiological knowledge about protein structure. AlphaFold operates in twostages. First, the trunk of the network processes the inputs throughrepeated neural network layers. Second, the trunk of the network isfollowed by the structure module that introduces a three-dimensionalstructure in the form of a rotation and translation for each residue ofthe protein.

Although AlphaFold is currently the most accurate three-dimensionalstructure prediction system, it still has not reached the ChristianAnfinsen ideal of determining a protein's three-dimensional structurepurely based on the amino acid sequence: it incorporates physical andbiological knowledge, both of which are fixed and not learned from data.AlphaFold can also require enormous computational resources andsignificant hyper-parameter tuning that might not facilitate scaling tolarger proteins.

Hence, what is needed is a method and a system for three-dimensionalprotein structure prediction based on the corresponding amino acidsequence that does not require physical and biological knowledge andthat is more efficient and scalable.

SUMMARY

One embodiment of the subject matter can facilitate three-dimensionalprotein structure prediction from an amino acid sequence based ondynamic programming and a probabilistic model learned from trainingexamples. This embodiment has several advantages. First, it is moreefficient than previous methods to predict three-dimensional proteinstructure from an amino acid sequence. This efficiency is both inprediction time and learning time. Prediction time is linear in thenumber of elements in the amino acid sequence. For a non-parallelversion, learning time is linear in the average number of elements inthe amino acid sequence times the number of training examples.Embodiments of the subject matter can also be parallelized over thetraining data to yield a learning time that is linear in the maximumnumber of elements over all amino acid sequences in the training data.

Second, embodiments of the subject matter can fit three-dimensionalstructures of greater complexity because it is based on a non-linearmodel and because it propagates local information globally, throughoutthe sequence. Third, it is optimal in that it guarantees a predictionthat is a most likely three-dimensional protein structure from the aminoacid sequence for the model. This guarantee is based on optimality ofdynamic programming and basic probability.

Fourth, it is general because it is rotation invariant. Fifth, it doesnot require physical or biological knowledge to determine a protein'sthree-dimensional configuration based on a corresponding amino acidsequence. Sixth, the framework is simpler to implement than that of DeepLearning methods such as in AlphaFold: no complex hyperparameters totune, no complex neural network structures to determine, and nospecialized hardware required for speed. Seventh, embodiments of thesubject matter facilitate greater opportunities for parallelism: acrosstraining examples and random restarts for learning and across subclassesfor both prediction and learning.

The details of one or more embodiments of the subject matter are setforth in the accompanying drawings and the description below. Otherfeatures, aspects, and advantages of the subject matter will becomeapparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 presents an example system for facilitating graph classification.

In the FIGURES, like reference numerals refer to the same FIGUREelements.

DETAILED DESCRIPTION

A protein's three-dimensional configuration can be represented inmultiple ways. In embodiments of the subject matter, thisthree-dimensional configuration is represented as a sequence of dihedralangles. More formally, in embodiments of the subject matter, an aminoacid sequence comprises an m-length (m≥2) sequence of amino acids.

In embodiments of the subject matter, training data for an amino acidsequence comprises an m−1-length sequence of dihedral angles, each ofwhich is between two contiguous amino acids. That is, there is one fewerdihedral angles than amino acids in the sequence.

Each dihedral angle comprises two angles, one for each plane: α and β. Asequence of such angles uniquely identifies the three-dimensionalconfiguration of the amino acid sequence, regardless of rotation. Thetwo angles between amino acids i and i+1 in the sequence can either useamino acid i for the origin or amino acid i+1 for the origin. However,the choice of origin must remain consistent across the entire sequenceand across all training examples. A preferred embodiment of the subjectmatter assumes amino acid i for the origin and amino acid i+1 as thedestination.

In embodiments of the subject matter, the prediction task is todetermine the dihedral angles between successive amino acids in a givenamino acid sequence. During operation, embodiments of the subject mattercan execute the following procedure.

#determinemostlikelystatesforeachelementinthesequence${s \in {S:}}\left. t_{1,s}\leftarrow{\mathcal{L}\left( {\begin{bmatrix}x_{1_{\gamma}} \\x_{2_{\gamma}} \\{o(s)}\end{bmatrix},{\mu_{\gamma:\tau}\Sigma_{{\gamma:\tau},{\gamma:\tau}}}} \right)} \right.\left. g_{1,s}\leftarrow s \right.\left. a_{1,s}\leftarrow{\hat{\mu}\left( {\begin{bmatrix}x_{1_{\gamma}} \\x_{2_{\gamma}} \\{o(s)}\end{bmatrix},{\theta:\theta},{\gamma:\tau},\mu,\Sigma} \right)} \right.{2 \leq i \leq {m - {1:}}}{s \in {S:}}\left. t_{i,s}\leftarrow{\max\limits_{s^{\prime} \in S}\left\{ {{l\left( {{\begin{bmatrix}x_{i_{\gamma}} \\x_{i + 1_{\gamma}} \\{o(s)}\end{bmatrix}❘\begin{bmatrix}a_{{i - 1},s^{\prime}} \\x_{i - 1_{\gamma}} \\{o\left( s^{\prime} \right)}\end{bmatrix}},{\gamma:\tau},{\theta^{\prime}:\tau^{\prime}},\overset{.}{\mu},\overset{.}{\Sigma}} \right)} + i_{{t - 1},{s\prime}}} \right\}} \right.\left. g_{i,s}\leftarrow{\underset{s^{\prime} \in S}{argmax}\left\{ {{l\left( {{\begin{bmatrix}x_{i_{\gamma}} \\x_{i + 1_{\gamma}} \\{o(s)}\end{bmatrix}❘\begin{bmatrix}a_{{i - 1},s^{\prime}} \\x_{i - 1_{\gamma}} \\{o\left( s^{\prime} \right)}\end{bmatrix}},{\gamma:\tau},{\theta^{\prime}:\tau^{\prime}},\overset{.}{\mu},\overset{.}{\Sigma}} \right)} + t_{{i - 1},{s\prime}}} \right\}} \right.\left. a_{i.s}\leftarrow{\hat{\mu}\left( {\begin{bmatrix}x_{i_{\gamma}} \\x_{i + 1_{\gamma}} \\{o(s)} \\a_{{i - 1},g_{i,s}} \\x_{i - 1_{\gamma}} \\{o\left( g_{i,s} \right)}\end{bmatrix},{\theta:\theta},{\gamma:\tau^{\prime}},\overset{.}{\mu},\overset{.}{\Sigma}} \right)} \right.$

First, embodiments of the subject matter determine the most likelystates for each element in the sequence. Here, S corresponds to a set ofstates. Typically, the set of states S={1 . . . k}, where k is apositive integer. States are like mixture components in a mixture model:they are merely identifiers that operate like a subclass in a model.More generally, the set of states can be any finite set of k elementssuch as {a,b,c,d}. During operation, embodiments of the subject mattertreat the set {a,b,c,d} the same as the set {1,2,3,4}. Though the stateshave different labels, the number of states is the same and hence thesetwo different sets of states will be treated equivalently by embodimentsof the subject matter. For convenience of implementation, a preferredembodiment of the subject matter uses states S={1 . . . k}, which isequivalent to any k element set of labels in embodiments of the subjectmatter.

The expression s∈S: corresponds to a “for” loop that is executed forevery state s∈S. For each state, for each element in the sequence,t_(i,s) stores a log maximum likelihood based on position i and state s.Similarly, g_(i,s) stores the predecessor state based on position i andstate s. In contrast, a_(i,s) stores the most likely dihedral anglesbased on position i and state s. Embodiments of the subject matter canthen use these stored values of t_(i,s), g_(i,s), and a_(i,s) todetermine t, g, and a for larger values of i and other states based ondynamic programming, which will be described shortly.

The assignment

$\left. t_{1,s}\leftarrow{\mathcal{L}\left( {\begin{bmatrix}x_{1_{\gamma}} \\x_{2_{\gamma}} \\{o(s)}\end{bmatrix},\mu_{\gamma:\tau},\Sigma_{{\gamma:\tau},{\gamma:\tau}}} \right)} \right.$

sets the first value of t for state s, where the values x₁ _(γ) and x₂_(γ) correspond to one-hot versions of the amino acids at the first andsecond locations in the input x, for which the three-dimensionalstructure is to be determined, the function o(s) returns the one-hotrepresentation of the state, and

(x, μ, Σ)=ln|Σ|+(x−μ)^(T)Σ⁻¹(x−μ).

The function

returns the ln (natural log) of the probability of x in a multivariateGaussian distribution with mean μ and covariance matrix Σ. (Theconstants such as π and ½ in

are removed here because they don't affect the outcome in embodiments ofthe subject matter.) Also, ln|Σ| is the natural log of the determinantof Σ, M^(T) is the transpose of matrix M, and

Σ⁻¹ is the inverse of a square matrix Σ.

Embodiments of the subject matter can simultaneously leverage dynamicprogramming by using the state and position as an index to saveprecomputed results and by using a one-hot version of the state toleverage multivariate Gaussians, both in conditional and unconditionaldistributions. For example t_(1,s) can be precomputed and stored forreuse through dynamic programming and o(s), the one-hot version of s,can be used in a Gaussian distribution because it comprises continuousvalues (though it is represented as a vector of continuous values, oneof which is always a 1 and the rest zeros).

As mentioned above, in embodiments of the subject matter, each aminoacid is represented as a one-hot vector. For example, if there are 21amino acids represented in alphabetical order, the one-hot vector foralanine (the first amino acid in alphabetical order) can be representedas 21-row column vector with a one in the first position and zeroeselsewhere:

$\begin{bmatrix}1 \\0 \\ \vdots \\0\end{bmatrix}.$

Since an amino acid is not required for indexing, embodiments of thesubject matter do not need two representations (original and one-hot)for amino acids. Only a one-hot is needed for input into a multivariateGaussian distribution.

A one-hot representation is frequently used in machine learning tohandle categorical data. In this representation a k-category variable isconverted to a k-length vector, where a 1 in location i of the k-lengthvector corresponds to the i^(th) categorical variable; the rest of thevector values are 0. For example, if the categories are A, B, and C,then a one-hot representation corresponds to a length three vector whereA can be represented as

$\begin{bmatrix}1 \\0 \\0\end{bmatrix},$

B as

$\begin{bmatrix}0 \\1 \\0\end{bmatrix},$

and C as

$\begin{bmatrix}0 \\0 \\1\end{bmatrix}.$

Other permutations of the vector can be used to equivalently representthe same three categorical variables.

The subscripts in a matrix refer to blocks of conformably partitionedmatrices indexed by the associated symbol. For example, μ_(θ) refers tothat block of the mean vector corresponding to the dihedral angles mean,μ_(γ) refers to that block of the mean vector corresponding to an aminoacid at a first location, μ_(γ) ₊ refers to that block of the meanvector corresponding to an amino acid at a next location, and μ_(τ)refers to that block of the mean vector corresponding to the state.

The covariance matrix Σ is similarly conformably partitioned as follows.The mean vector μ is conformably partitioned as

$\begin{bmatrix}\mu_{\theta} \\\mu_{\gamma} \\\mu_{\gamma^{+}} \\\mu_{\tau}\end{bmatrix},$

the covariance matrix Σ is conformably partitioned as

$\begin{bmatrix}\Sigma_{\theta,\theta} & \cdots & \Sigma_{\theta,\tau} \\ \vdots & \ddots & \vdots \\\Sigma_{\tau,\theta} & \cdots & \Sigma_{\tau,\tau}\end{bmatrix},$

the second mean vector {dot over (μ)} is conformably partitioned as

$\begin{bmatrix}{\overset{.}{\mu}}_{\theta} \\{\overset{.}{\mu}}_{\gamma} \\{\overset{.}{\mu}}_{\gamma^{+}} \\{\overset{.}{\mu}}_{\tau} \\{\overset{.}{\mu}}_{\theta^{\prime}} \\{\overset{.}{\mu}}_{\gamma^{\prime}} \\{\overset{.}{\mu}}_{\tau^{\prime}}\end{bmatrix},$

and the second covariance matrix {dot over (Σ)} is conformablypartitioned as

$\begin{bmatrix}{\overset{.}{\Sigma}}_{\theta,\theta} & \cdots & {\overset{.}{\Sigma}}_{\theta,\tau^{\prime}} \\ \vdots & \ddots & \vdots \\{\overset{.}{\Sigma}}_{\tau^{\prime},\theta} & \cdots & {\overset{.}{\Sigma}}_{\tau^{\prime},\tau^{\prime}}\end{bmatrix}.$

The prime (′) notation refers to an immediate predecessor in thesequence. For example, {dot over (μ)}_(θ′) refers to that block of the{dot over (μ)} vector for a dihedral angle before the dihedral angleassociated with {dot over (μ)}_(θ). The plus (+) superscript notationrefers to an immediate successor in the sequence.

The range notation a: b follows the order of variables that appear in μ,Σ, {dot over (μ)} and {dot over (Σ)}. For example, γ: τ′ specifies arange of blocks from γ to τ′: γ, γ⁺, τ, θ′, γ′, τ′. This range notationis merely a compact and succinct way to specify blocks of a conformablypartitioned vector or matrix. The block order described here facilitatessimpler notation of these ranges for the particular uses in embodimentsof the subject matter.

An advantage of a multivariate Gaussian distribution in the likelihoodfunction

is that a missing value can simply be ignored. Only those blockscorresponding to known variables in the mean vector and covariancematrix are required to produce the same result as if marginalizing overmissing variables.

The assignment g_(1,s)←s sets the first value of g to s for the state s.The assignment

$\left. a_{1,s}\leftarrow{\hat{\mu}\left( {\begin{bmatrix}x_{1_{\gamma}} \\x_{2_{\gamma}} \\{o(s)}\end{bmatrix},{\theta:\theta},{\gamma:\tau},\mu,\Sigma} \right)} \right.$

sets the first dihedral angles for a at state s to the most likelydihedral angles given x₁ _(γ) x₂ _(γ) , o(s). The value a_(1,s) is atwo-row column vector corresponding to the most likely dihedral angle atthe first location for state s (recall that the dihedral valuecorresponds to two angles, α and β).

Here, {circumflex over (μ)}(x, a, b, μ, Σ)=μ_(a)+Σ_(a,b)Σ_(b,b)⁻¹(x−μ_(b)), which is the conditional mean of a multivariate Gaussiandistribution. In the function {circumflex over (μ)}, the variable acorresponds to a block associated with the dihedral angles and thevariable b corresponds to the block associated with the rest of thevariables that are known at the time of dihedral angles prediction forthis index and state.

The three initial assignments set the base values for t, g, and a, willbe used to set values for t, g, and a later in the sequence throughdynamic programming. An alternative to the base values and μ and Σ is toinclude a dummy amino acid and dihedral angles at a dummy border (firstlocation), and only use {dot over (μ)} and {dot over (Σ)}, and a “for”loop, which will be described shortly.

Although such dummy borders are common in image processing to reducecode, the problem with dummy borders is that a dummy state is requiredfor those edges as well as dummy values such as the amino acid anddihedral angles. Zeros are often used as for such values associated withdummy borders, but this can bias the values of {dot over (μ)} and {dotover (Σ)}, especially if zeros are actual values in the rest of thesequence.

A disadvantage of using edge cases instead of borders is that,statistically, there are fewer edge cases and interior cases in thetraining data. For example, with n k-length sequences, there will onlybe n edge cases but n×k interior cases. In the spirit of greater clarityand potentially improved accuracy, description of embodiments of thesubject matter here avoid minor tricks such as dummy borders to reducethe amount of code.

The expression 2≤i≤m−1: corresponds to a “for” loop that loops throughvalues of i from 2 to m−1, inclusive, where m is the number of elementsin the amino acid sequence for which the dihedral angles are to bedetermined. The assignment

$\left. t_{i,s}\leftarrow{\max\limits_{s^{\prime} \in S}\left\{ {{l\left( {{\begin{bmatrix}x_{i_{\gamma}} \\x_{i + 1_{\gamma}} \\{o(s)}\end{bmatrix}❘\begin{bmatrix}a_{{i - 1},s^{\prime}} \\x_{i - 1_{\gamma}} \\{o\left( s^{\prime} \right)}\end{bmatrix}},{\gamma:\tau},{\theta^{\prime}:\tau^{\prime}},\overset{.}{\mu},\overset{.}{\Sigma}} \right)} + i_{{t - 1},{s\prime}}} \right\}} \right.$

sets t_(i,s) to the log likelihood of the most likely value state values′ based on the right-hand-side of the assignment, part of which hasalready been determined and stored in t_(i−1,s′), where l(x|y, a, b, μ,Σ)=

(x, {circumflex over (μ)}(y, a, b, μ, Σ), {circumflex over (Σ)}(a, b,Σ)) and {circumflex over (Σ)}(a, b, Σ)=Σ_(a)−Σ_(a,b)Σ_(b,b) ⁻¹τ_(b,a).{circumflex over (Σ)}(a, b, Σ) corresponds to the conditional covarianceof a multivariate Gaussian. Together, the mean {circumflex over (μ)}(a,b, {dot over (μ)}, {dot over (Σ)}, y) and covariance matrix {circumflexover (Σ)}(a, b, {dot over (Σ)}) define a multivariate Gaussiandistribution.

The assignment

$\left. g_{i,s}\leftarrow{\underset{s^{\prime} \in S}{argmax}\left\{ {{l\left( {{\begin{bmatrix}x_{i_{\gamma}} \\x_{i + 1_{\gamma}} \\{o(s)}\end{bmatrix}❘\begin{bmatrix}a_{{i - 1},s^{\prime}} \\x_{i - 1_{\gamma}} \\{o\left( s^{\prime} \right)}\end{bmatrix}},{\gamma:\tau},{\theta^{\prime}:\tau^{\prime}},\overset{.}{\mu},\overset{.}{\Sigma}} \right)} + t_{{i - 1},{s\prime}}} \right\}} \right.$

records (saves in memory) the most likely predecessor state s′ for theprevious assignment for t_(i,s). This recording will be used in the nextassignment to determine the dihedral angles based on this state andother information and a backtrace method, which will be described.

The assignment

$\left. a_{i.s}\leftarrow{\hat{\mu}\left( {\begin{bmatrix}x_{i_{\gamma}} \\x_{i + 1_{\gamma}} \\{o(s)} \\a_{{i - 1},g_{i,s}} \\x_{i - 1_{\gamma}} \\{o\left( g_{i,s} \right)}\end{bmatrix},{\theta:\theta},{\gamma:\tau^{\prime}},\overset{.}{\mu},\overset{.}{\Sigma}} \right)} \right.$

sets a_(i,s) to the most likely dihedral angles given

$\begin{bmatrix}x_{i_{\gamma}} \\x_{i + 1_{\gamma}} \\{o(s)} \\a_{{i - 1},g_{i,s}} \\x_{i - 1_{\gamma}} \\{o\left( g_{i,s} \right)}\end{bmatrix}.$

Note that the quantities a_(i−1,g) _(i,s) and g_(i,s) have beenpreviously been determined through dynamic programming.

The term dynamic programming as used by embodiments of the subjectmatter is means that quantities precomputed earlier in the sequence canbe used to later in the sequence. Dynamic programming is efficientbecause of this re-use of precomputed data. More generally, dynamicprogramming can be used to solve an optimization problem by dividing itinto simpler subproblems where an optimal solution to the overallproblem is based on an optimal solution to the simpler subproblems. Inembodiments of the subject matter, the optimization problem ismaximization and “simpler” corresponds values that have been precomputedearlier in the sequence. For example, all three of t_(i,s), g_(i,s), anda_(i,s) have been determined with dynamic programming because they areall based on previously determined values from earlier in the sequence.In probabilistic terms, a_(i,s) is also determined with optimization,though a closed-form, because the function {circumflex over (μ)} returnsthe most likely value, which is the mean of the conditional multivariateGaussian distribution.

Once the values for t, g, and a have been determined, embodiments of thesubject matter can backtrace from high to low values of the sequence tofind a most likely sequence of states for the amino acid sequence. Thebacktrace procedure is shown below.

#backtracetofindmostlikelystatesequence$\left. r_{m - 1}\leftarrow{\underset{s \in S}{argmax}\left\{ t_{{m - 1},s} \right\}} \right.\left. {{m - 1} \geq i \geq {2:r_{i - 1}}}\leftarrow g_{i,r_{i}} \right.$

The backtrace assignments begin with the penultimate index value, m−1,in the sequence. Specifically, the assignment

$\left. r_{m - 1}\leftarrow{\underset{s \in S}{argmax}\left\{ t_{{m - 1},s} \right\}} \right.$

stores the most likely state for position m−1 in a sequence of m aminoacids.

Subsequently, m−1≥i≥2:r_(i−1)←g_(i,r) _(i) sets the values for theremaining positions from m−1 down to 2. Because the “for” loop runs fromm−1 down to 2, the assignment is based on the previously set index valuer_(i). This is another use of dynamic programming in embodiments of thesubject matter.

Once these most likely states are found for each index in the sequence,embodiments of the subject matter can determine the dihedral anglesbased on the most likely states and their associated dihedral angles.

#Determine Dihedral Angles

1≥i≥m−1:h _(i) ←a _(i,r) _(i)

The “for” loop 1≥i≥m−1: h_(i)←a_(i,r) _(i) determines the dihedralangles for each index. In this case, the loop proceeds from low to highbecause all values of a_(i,r) _(i) have already been determined. Thisloop can be executed in parallel. After the loop completes, the dihedralangles for the position i in the sequence is equal to h_(i). All of thedihedral angles in the sequence uniquely determine the three-dimensionalstructure of the protein corresponding to the amino acid sequence.

Embodiments of the subject matter can execute the following steps tolearn a prediction model based on a multivariate Gaussian distribution.The first step in learning the prediction model in embodiments of thesubject matter randomly initializes the states for each sequence(training example), for each dihedral angle in the sequence. Here, ncorresponds to the number of training examples and m_(j) corresponds tothe number of elements in the sequence for training example j. The rangefor i≤m_(j)−1 because there are one fewer dihedral angles than aminoacids in the sequence.

  # randomly initialize states   1 ≤ j ≤ n:  1 ≤ i ≤ m_(j) − 1:   r_(i)^(j) ← random(S)

${\#{update}{model}}\left. {data}\leftarrow\varnothing \right.\left. \overset{.}{data}\leftarrow\varnothing \right.{1 \leq j \leq {n:}}{{data} \cdot {{append}\left( \begin{bmatrix}x_{1_{\theta}}^{j} \\x_{1_{\gamma}}^{j} \\x_{2_{\gamma}}^{j} \\{o\left( r_{1}^{j} \right)}\end{bmatrix} \right)}}{2 \leq i \leq {m_{j} - {1:}}}{\overset{.}{data} \cdot {{append}\left( \begin{bmatrix}x_{i_{\theta}}^{j} \\x_{i_{\gamma}}^{j} \\x_{i + 1_{\gamma}}^{j} \\{o\left( r_{i}^{j} \right)} \\x_{i - 1_{\theta}}^{j} \\x_{i - 1_{\gamma}}^{j} \\{o\left( r_{i}^{j} \right)}\end{bmatrix} \right)}}\left. \mu\leftarrow{{data} \cdot {{mean}{()}}} \right.\left. \Sigma\leftarrow{{data} \cdot {{covariance}{()}}} \right.\left. \overset{.}{\mu}\leftarrow{{data} \cdot {{mean}{()}}} \right.\left. \overset{.}{\Sigma}\leftarrow{{data} \cdot {{covariance}{()}}} \right.$

During operation, embodiments of the subject matter can execute theupdate model box above. The box describes two data stores, data and databoth of which are initially set to empty (i.e. ø). These data stores cancorrespond to sets, lists, arrays, or any other structure capable ofstoring and retrieving data. The first loops handles the edge cases foreach training sequence (n is the number of training examples). Thesecond loop handles the interior (non-edge) cases for each trainingsequence (m_(j) is the number of amino acids in the sequence associatedwith training example j). In either case, the append operation adds tocorresponding example to the training data. The superscripts here areused to denote a particular training example. For example, x_(i) _(θ)^(j) corresponds to the i^(th) dihedral angle in an amino acid sequencefrom the j^(th) training example. Similarly, r_(i) ^(j) corresponds tothe i^(th) state value in an amino acid sequence from the j^(th)training example.

Subsequently, when all data has been appended, embodiments of thesubject matter can determine the mean and covariance matrices of eachset of training data. Multiple ways can be used to determine thesematrices and to prevent singularity, a small value can be added alongthe diagonal of each covariance matrix.

Recall that the prediction model comprises two mean vectors and twocovariance matrices. The undotted vectors and matrices correspond to theedge cases for training: the are based on data at the first and secondamino acids, the state at the first amino acid, and the first dihedralangle. The dotted vectors and matrices correspond to the non-edge cases:they are based on data at all subsequence pairs of amino acids, states,and dihedral angles.

Embodiments of the subject matter can predict the most likely states forevery pair of dihedral angles in a given sequence for each trainingexample and then update the mean and covariance matrix based on thosemost likely states until convergence. These steps are shown in the boxbelow. After embodiments of the subject matter execute the update modelbox, the next few steps are similar as in the prediction method inembodiments of the subject matter, except that the dihedral angles areknown during learning.

Convergence can be defined in several ways. One way is with a fixednumber of iterations of the above routine. Another way is until adifference of an aggregation of

$\max\limits_{s \in S}\left\{ t_{i,m_{j},s} \right\}$

over all training examples 1≤j≤n between successive iterations is lessthan a given threshold. Aggregation functions include but are notlimited to mean, max, min, and sum. A difference can be absolute orrelative. Convergence can also be defined as reaching a local maximum inlikelihood.

$\left. {\left. {\left. {\left. {{\#{repeat}{until}{convergence}}{{update}{model}}{1 \leq j \leq {n:}}{s \in {S:}}\left. t_{1,s}\leftarrow{\mathcal{L}\left( {\begin{bmatrix}x_{1_{\theta}}^{j} \\x_{1_{\gamma}}^{j} \\x_{2_{\gamma}}^{j} \\{o(s)}\end{bmatrix},\mu,\Sigma} \right)} \right.\left. g_{1,s}\leftarrow s \right.{2 \leq i \leq {m_{j} - {1:}}}{s \in {S:}}\left. t_{i,s}\leftarrow{\max\limits_{s^{\prime} \in S}\left\{ {{l\left( {\begin{matrix}x_{i_{\theta}}^{j} \\x_{i_{\gamma}}^{j} \\x_{i + 1_{\gamma}}^{j} \\{o(s)}\end{matrix}❘\begin{bmatrix}x_{i - 1_{\theta}}^{j} \\x_{i - 1_{\gamma}}^{j} \\{o\left( s^{\prime} \right)}\end{bmatrix}} \right)},{\theta:\tau},{\theta:\tau^{\prime}},\overset{.}{\mu},\overset{.}{\Sigma}} \right.} \right.} \right) + t_{{i - 1},s^{\prime}}} \right\}\left. g_{i,s}\leftarrow{\underset{s^{\prime} \in S}{argmax}\left\{ {{l\left( {\begin{matrix}x_{i_{\theta}}^{j} \\x_{i_{\gamma}}^{j} \\x_{i + 1_{\gamma}}^{j} \\{o(s)}\end{matrix}❘\begin{bmatrix}x_{i - 1_{\theta}}^{j} \\x_{i - 1_{\gamma}}^{j} \\{o\left( s^{\prime} \right)}\end{bmatrix}} \right)},{\theta:\tau},{\theta:\tau^{\prime}},\overset{.}{\mu},\overset{.}{\Sigma}} \right.} \right.} \right) + t_{{i - 1},s^{\prime}}} \right\}\left. r_{m_{j - 1}}^{j}\leftarrow{\underset{s \in S}{argmax}\left\{ t_{{m_{j} - 1},s} \right\}} \right.\left. {{m_{j} - 1} \geq i \geq {2:r_{i - 1}^{j}}}\leftarrow g_{r_{i}^{j}} \right.$

Multiple random restarts, each with different random state assignmentscan improve the probability of finding a global maximum in thelikelihood. These multiple random restarts can be run in parallel andthe model with largest aggregation of

$\max\limits_{s \in S}\left\{ t_{i,m_{j},s} \right\}$

over all training examples can be chosen as the best model.Alternatively, an ensemble of the top k models can be chosen.

Note that a mathematically equivalent version of the assignment for tand g can be defined in terms of a product of probabilities rather thana sum of log of the probabilities. The product of probabilities canresult in extremely low numbers, which can cause hardware underflow.Hence in embodiments of the subject matter, the sum of the log ofprobabilities is preferred for reasons of greater precision. Moreover,with this form, the multivariate Gaussian distribution simplifies sothat no exponentials are required. Other mathematically equivalentroutines can be used as well as approximations of the multivariateGaussian distribution.

An appropriate number of states (as in {1 . . . k}) can be determined inmultiple different ways. For example, a validation set of sequences canbe reserved and used to evaluate the likelihood of the sequences usingan aggregation of

$\max\limits_{s \in S}{\left\{ t_{i,m_{j},s} \right\}.}$

The number of states can be explored from 1 . . . k until a maximum inthe likelihood is found (the peak method) or until the likelihood doesnot significantly increase (the elbow method). These methods are similarto those of finding an appropriate number of mixtures for a Gaussianmixture distribution.

Embodiments of the subject matter spread local information globally intwo ways. During prediction, both the state and the predicted dihedralangles at a location are used to predict both the state and thepredicted dihedral angles at the subsequent location, the oneimmediately after. This propagation is guaranteed to be optimal eventhough the direction is from low to high location values. Hence,embodiments of the subject matter do not require repeated propagationsas in deep learning. During learning, only the state information getpropagated because the dihedral angles are known. In both cases, localinformation is spread globally, but more efficiently than in deeplearning.

Embodiments of the subject matter can also be generalized to k^(th)order model between neighbors. For example, the formula to determinet_(i,s) can be based on the last two predecessors: s″, x_(i−2) _(γ) ,and a_(i−2,s∝), where the maximization is over and s″ and t can includean extra state s′ as in

$t_{i,s,s^{\prime}} = {\max\limits_{s^{\prime} \in S}\left\{ {{l\left( {{\begin{bmatrix}x_{i_{\gamma}} \\x_{i + 1_{\gamma}} \\{o(s)}\end{bmatrix}❘\begin{bmatrix}a_{{i - 1},s^{\prime}} \\x_{i - 1_{\gamma}} \\{o\left( s^{\prime} \right)} \\a_{{i - 2},s^{''}} \\x_{i - 2_{\gamma}} \\{o\left( s^{''} \right)}\end{bmatrix}},{\gamma:\tau},{\theta^{\prime}:\tau^{''}}} \right)} + t_{i,s^{\prime},s^{''}}} \right\}}$

This example showed how embodiments of the subject matter can beextended to a second-order model. Extending to embodiments of thesubject matter to a k^(th) order model involves adding additionalsuccessor states to t_(i,s,s′) as in t_(i,s,s′,s″,s″′ . . .) , addingsuccessor data to the conditional part, including a larger selection ofblocks in the conditional part, and shifting all the states over for thepreviously computed value of t. Theoretically, any higher-order modelcan be transformed into a first order model by adding more and morestates.

FIG. 1 shows an example three-dimensional structure prediction system100 in accordance with an embodiment of the subject matter.Three-dimensional structure prediction system 100 is an example of asystem implemented as a computer program on one or more computers in oneor more locations (shown collectively as computer 102), with one or morestorage devices (shown collectively as storage 108), in which thesystems, components, and techniques described below can be implemented.

Three-dimensional structure prediction system 100 predicts thethree-dimensional structure of an amino acid sequence. During operation,three-dimensional structure prediction system 100 determines, withangle-pair determining subsystem 110, a first pair of angles indexed bya first position in the sequence and a first state, based on a firstamino acid indexed by the first position in the amino acid sequence, asecond amino acid indexed by a second position the amino acid sequence,the first state, a second pair of angles indexed by a third position anda second state, a third amino acid indexed by a third position in theamino acid sequence, and the second state. Here, the first position isin proximity to the second position, and the first position is inproximity to the third position. Moreover, the second pair of anglesindexed by the third position and the second state has previously beendetermined by dynamic programming

More specifically, angle-pair determining subsystem determines a_(i,s),which corresponds to the first pair of angles, which are between x_(i)_(γ) and x_(i+1) _(γ) , which correspond to the first amino acid and thesecond amino acids, respectively. The sequence of amino acidscorresponds to x. The first state corresponds to s. The second statecorresponds to g_(i,s), which has been determined with dynamicprogramming. The second pair of angles corresponds to a_(i−1,g) _(i,s) ,which are between x_(i) _(γ) and x_(i−1) _(γ) . Note that a_(i−1,g)_(i,s) , which corresponds to the second pair of angles indexed by thethird position and the second state, has previously been determined bydynamic programming. Also note that g_(i,s) has itself been previouslydetermined by dynamic programming.

The locations for these angles and amino acids are referenced by i, i−1,and i+1. Here, i refers to the first position, i+1 refers to the secondposition, and i−1 refers to the third position. In this example, i (thefirst position) is in proximity to i+1 (the second position). Also inthis example, i (the first position) is in proximity to i−1 (the thirdposition).

Subsequently, three-dimensional structure prediction system 100 returnsa result indicating the three-dimensional structure based on the firstpair of angles with three-dimensional structure return result indicatingsubsystem 120. Clearly, any three-dimensional structure will at leastinclude these first pair of angles in addition to all the other pairs ofangles for the rest of the sequence.

The preceding description is presented to enable any person skilled inthe art to make and use the subject matter, and is provided in thecontext of a particular application and its requirements. Variousmodifications to the disclosed embodiments will be readily apparent tothose skilled in the art, and the general principles defined herein maybe applied to other embodiments and applications without departing fromthe spirit and scope of the subject matter. Thus, the subject matter isnot limited to the embodiments shown, but is to be accorded the widestscope consistent with the principles and features disclosed herein.

Embodiments of the subject matter and the functional operationsdescribed in this specification can be implemented in digital electroniccircuitry, in tangibly embodied computer software or firmware, incomputer hardware, including the structures disclosed in thisspecification and their structural equivalents, or in combinations ofone or more of them.

Embodiments of the subject matter described in this specification can beimplemented as one or more computer programs, i.e., one or more modulesof computer program instructions encoded on a tangible non-transitoryprogram carrier for execution by, or to control the operation of dataprocessing system.

A computer program (which may also be referred to or described as aprogram, software, a software application, a module, a software module,a script, or code) can be written in any form of programming language,including compiled or interpreted languages, or declarative orprocedural languages, and it can be deployed in any form, including as astand-alone program or as a module, component, subroutine, or other unitsuitable for use in a computing environment.

A computer program may, but need not, correspond to a file in a filesystem. A program can be stored in a portion of a file that holds otherprograms or data, e.g., one or more scripts stored in a markup languagedocument, in a single file dedicated to the program in question, or inmultiple coordinated files, e.g., files that store one or more modules,sub-programs, or portions of code.

Alternatively, or in addition, the program instructions can be encodedon an artificially generated propagated signal, e.g., amachine-generated electrical, optical, or electromagnetic signal, thatis generated to encode information for transmission to a suitablereceiver system for execution by a data processing system. The computerstorage medium can be a machine-readable storage device, amachine-readable storage substrate, a random or serial access memorydevice, or a combination of one or more of them.

Computers suitable for the execution of a computer program include, byway of example, can be based on general or special purposemicroprocessors or both, or any other kind of central processing unit.Generally, a central processing unit will receive instructions and datafrom a read only memory or a random-access memory or both. The essentialelements of a computer are a central processing unit for performing orexecuting instructions and one or more memory devices for storinginstructions and data.

A computer can also be distributed across multiple sites andinterconnected by a communication network, executing one or morecomputer programs to perform functions by operating on input data andgenerating output.

A computer can also be embedded in another device, e.g., a mobiletelephone, a personal digital assistant (PDA), a mobile audio or videoplayer, a game console, a Global Positioning System (GPS) receiver, or aportable storage device, e.g., a universal serial bus (USB) flash drive,to name just a few.

Generally, a computer will also include, or be operatively coupled toreceive data from or transfer data to, or both, one or more mass storagedevices for storing data, e.g., magnetic, magneto optical disks, oroptical disks. However, a computer need not have such devices.

The term “data processing system’ encompasses all apparatus, devices,and machines for processing data, including by way of example aprogrammable processor, a computer, or multiple processors or computers.

For a system of one or more computers to be configured to performparticular operations or actions means that the system has installed onit in software, firmware, hardware, or a combination of them that inoperation cause the system to perform the operations or actions. For oneor more computer programs to be configured to perform particularoperations or actions means that the one or more programs includeinstructions that, when executed by data processing system, cause thesystem to perform the operations or actions.

The processor and the memory can be supplemented by, or incorporated in,special purpose logic circuitry. More generally, the processes and logicflows can also be performed by and be implemented as special purposelogic circuitry, e.g., an FPGA (field programmable gate array) or anASIC (application specific integrated circuit), a dedicated or sharedprocessor that executes a particular software module or a piece of codeat a particular time, and/or other programmable-logic devices now knownor later developed. When the hardware modules or system are activated,they perform the methods and processes included within them.

The system can also include, in addition to hardware, code that createsan execution environment for the computer program in question, e.g.,code that constitutes processor firmware, a protocol stack, a databasemanagement system, an operating system, or a combination of one or moreof them.

The computer-readable storage medium includes, but is not limited to,volatile memory, non-volatile memory, magnetic and optical storagedevices such as disk drives, magnetic tape, CDs (compact discs), DVDs(digital versatile discs or digital video discs), computer instructionsignals embodied in a transmission medium (with or without a carrierwave upon which the signals are modulated), and other media capable ofstoring computer-readable media now known or later developed. Forexample, the transmission medium may include a communications network,such as a LAN, a WAN, or the Internet.

Computer readable media suitable for storing computer programinstructions and data include all forms of non-volatile memory, mediaand memory devices, including by way of example semiconductor memorydevices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks,e.g., internal hard disks or removable disks; magneto optical disks; andCD ROM and DVD-ROM disks.

The methods and processes described in the detailed description sectioncan be embodied as code and/or data, which can be stored in acomputer-readable storage medium as described above. When a computersystem reads and executes the code and/or data stored on thecomputer-readable storage medium 120, the computer system performs themethods and processes embodied as data structures and code and storedwithin the computer-readable storage medium.

The components of the system can be interconnected by any form or mediumof digital data communication, e.g., a communication network. Examplesof communication networks include a local area network (“LAN”) and awide area network (“WAN”), e.g., the Internet.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of any subjectmatter or of what may be claimed, but rather as descriptions of featuresthat may be specific to particular embodiments of particular subjectmatters. Certain features that are described in this specification inthe context of separate embodiments can also be implemented incombination in a single embodiment.

Conversely, various features that are described in the context of asingle embodiment can also be implemented in multiple embodimentsseparately or in any suitable sub-combination. Moreover, althoughfeatures may be described above as acting in certain combinations andeven initially claimed as such, one or more features from a claimedcombination can in some cases be excised from the combination, and theclaimed combination may be directed to a sub-combination or variation ofa sub-combination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. In certain circumstances, multitasking and parallel processingmay be advantageous.

Moreover, the separation of various system modules and components in theembodiments described above should not be understood as requiring suchseparation in all embodiments, and it should be understood that thedescribed program components and systems can generally be integratedtogether in a single software product or packaged into multiple softwareproducts.

The foregoing descriptions of embodiments of the subject matter havebeen presented only for purposes of illustration and description. Theyare not intended to be exhaustive or to limit the subject matter to theforms disclosed. Accordingly, many modifications and variations will beapparent to practitioners skilled in the art. Additionally, the abovedisclosure is not intended to limit the subject matter. The scope of thesubject matter is defined by the appended claims.

What is claimed is:
 1. A computer-implemented method for facilitatingpredicting a three-dimensional structure of an amino acid sequencecomprising: determining a first pair of angles indexed by a firstposition in the sequence and a first state, based on a first amino acidindexed by the first position in the amino acid sequence, a second aminoacid indexed by a second position the amino acid sequence, the firststate, a second pair of angles indexed by a third position and a secondstate, a third amino acid indexed by a third position in the amino acidsequence, and the second state, wherein the first position is inproximity to the second position, wherein the first position is inproximity to the third position, and wherein the second pair of anglesindexed by the third position and the second state has previously beendetermined by dynamic programming; and returning a result indicating thethree-dimensional structure of the amino acid sequence based on thefirst pair of angles.
 2. The method of claim 1, wherein determining thefirst pair of angles is based on a multivariate Gaussian distributioncomprising a mean vector and a covariance matrix.
 3. The method of claim2, wherein the mean vector and the covariance matrix are learned fromtraining data comprising at least a third amino acid, a fourth aminoacid, a third state associated with the third amino acid, and a thirdpair of angles associated with the third amino acid.
 4. The method ofclaim 3, wherein the mean vector and the covariance matrix are learnedfrom training data comprising one-hot representations of the third stateand the third and fourth amino acids.
 5. One or more non-transitorycomputer-readable storage media storing instructions that when executedby one or more computers cause the one or more computers to performoperations for facilitating predicting a three-dimensional structure ofan amino acid sequence, comprising: determining a first pair of anglesindexed by a first position in the sequence and a first state, based ona first amino acid indexed by the first position in the amino acidsequence, a second amino acid indexed by a second position the aminoacid sequence, the first state, a second pair of angles indexed by athird position and a second state, a third amino acid indexed by a thirdposition in the amino acid sequence, and the second state, wherein thefirst position is in proximity to the second position, wherein the firstposition is in proximity to the third position, and wherein the secondpair of angles indexed by the third position and the second state haspreviously been determined by dynamic programming; and returning aresult indicating the three-dimensional structure of the amino acidsequence based on the first pair of angles.
 6. The one or morenon-transitory computer-readable storage media of claim 5, whereindetermining the first pair of angles is based on a multivariate Gaussiandistribution comprising a mean vector and a covariance matrix.
 7. Theone or more non-transitory computer-readable storage media of claim 6,wherein the mean vector and the covariance matrix are learned fromtraining data comprising at least a third amino acid, a fourth aminoacid, a third state associated with the third amino acid, and a thirdpair of angles associated with the third amino acid.
 8. The one or morenon-transitory computer-readable storage media of claim 7, wherein themean vector and the covariance matrix are learned from training datacomprising one-hot representations of the third state and the third andfourth amino acids.
 9. A system comprising one or more computers and oneor more storage devices storing instructions that when executed by theone or more computers cause the one or more computers to performoperations for facilitating predicting a three-dimensional structure ofan amino acid sequence, comprising: determining a first pair of anglesindexed by a first position in the sequence and a first state, based ona first amino acid indexed by the first position in the amino acidsequence, a second amino acid indexed by a second position the aminoacid sequence, the first state, a second pair of angles indexed by athird position and a second state, a third amino acid indexed by a thirdposition in the amino acid sequence, and the second state, wherein thefirst pair of angles is indexed by the first state and a first location,wherein the first position is in proximity to the second position,wherein the first position is in proximity to the third position, andwherein the second pair of angles indexed by the third position and thesecond state has previously been determined by dynamic programming; andreturning a result indicating the three-dimensional structure of theamino acid sequence based on the first pair of angles.
 10. The system ofclaim 9, wherein determining the first pair of angles is based on amultivariate Gaussian distribution comprising a mean vector and acovariance matrix.
 11. The system of claim 10, wherein the mean vectorand the covariance matrix are learned from training data comprising atleast a third amino acid, a fourth amino acid, a third state associatedwith the third amino acid, and a third pair of angles associated withthe third amino acid.
 12. The system of claim 11, wherein the meanvector and the covariance matrix are learned from training datacomprising one-hot representations of the third state and the third andfourth amino acids.